Friday, May 07, 2010

Digital Preservation Matters - May 7, 2010

The Library of Congress Unlocks The Ultimate Archive System. Ken Weissman. Creative COW Magazine.  7 May 2010.
Article about the Library of Congress' Film Preservation Laboratory, where they are working on the  ultimate archive system, starting with the restoration of films first printed on paper instead of film.  One of the Library's main missions is to preserve America's memory for future generations of Americans, with no end point. They have a collection of 6.3 million audio-visual and film materials.  They have a pilot project digitizing their paper print collection. They are designing the workflow and they use the concept of Preservation Index (preservation quality of a storage environment).  The plan for now is to "scan the images, restore or preserve them as needed, then run them back to film, and put the film away at 25 degrees, 30% relative humidity, for practically forever. For most people, in practice, somewhere between 600 and 2000 years is beyond forever."


Zettabytes overtake petabytes as largest unit of digital measurement. Heidi Blake.  04 May 2010.
IDC, the technology consultancy released in the annual survey of the world's digital output.  Humanity's total digital output currently stands at 8,000,000 petabytes but is expected to pass 1.2 zettabytes this year. The rapid growth of the "digital universe" has been caused by the explosion of social networking, online video, digital photography and mobile phones. Around 70 % of the world's digital content is created by individuals, but it is stored on content-sharing websites such as Flickr and YouTube.  In 2007, they estimated that the digital universe was equivalent to 161,000 petabytes.  [The term exabyte was used previously, it is larger than a petabyte.] They estimate the digital universe over the next decade will expand by a factor of 44.  Read more about Sortabytes, Peptabytes, and Lumabytes


The Sun-Times Preserves Its Photo Archive by Selling It. Michael Miner. Chicago Reader. May 6, 2010.
John Rogers claims to have the world's largest private collection of vintage photos, about 30 million images. The Sun-Times has sold its archive of more than a million photos and negatives to Rogers, though they retain the intellectual rights.  And Rogers is obliged to re-create the "entire library in digital searchable form".  He is doing for the Sun-Times what they wanted to do, but couldn't afford.  They estimated the processing would have taken many years and millions of dollars.  Once the photo archive is digitized the Sun-Times  will be able to tap a growing "aftermarket" for copies of old news photos.  Rogers can digitize 200,000 images a month and hoping to go to 400,000.  The creation of metadata is the expensive part. 

The Rocky Mountain News is a good example of what can happen when a newspaper folds.  The paper went out of business in February 2009.  "All those photos were given to the Denver Public Library and are sitting in a basement in storage. The library can't sell them to me, and they don't have the money to digitize them. So they'll stay in the basement. I spoke to a very nice lady at the library. I said, 'Can they be accessed by the public?' She said, 'Not at this time.' 'Will they ever be digitized?' 'We don't have the funds to do it." Instead, he bought the Denver Post, so he considers the Rocky Mountain News pictures redundant.


New program helps secure dataScience Alert. 28 April 2010. 
Researchers at Monash University have developed the  MyTARDIS/TARDIS program to give researchers a place to securely store research information. It also has the ability  to share the most complex of scientific data through the internet. "The program records the data generated from an experiment, catalogues it, making it searchable, and transfers it back to the home institution, where the researcher can analyse the data using MyTARDIS, then make it publicly available on the TARDIS system alongside publication of the results in a scientific journal."  [What better name?]  The software has created a central place where researchers can exchange information rapidly and securely.

"Link Rot" & Legal Resources on the Web: A 2010 Analysis.  Sarah Rhodes. Legal Information Archive. May 2010.
The Legal Information Archive site has information about the Chesapeake Project, which contains government, policy, and legal information archived by several law libraries. 

This particular article their third annual analysis of link rot among the original URLs for archived materials.  The term "link rot" refers to a URL that no longer points to the resource that it originally did.  The link may return a not found message, or may point to a different resource. The results of their evaluation of 1,266 born-digital online titles that were harvested, link rot was found to be present in:
  • 2008:   48 of 579 URLs,  8.3 %
  • 2009:   83 of 579 URLs, 14.3 %
  • 2010: 160 of 579 URLs, 27.9 %
A results table shows that over 90 % of the top-level domains in the sample were state-government, .org, and .gov URLs. 


Imation launches broad line of secure removable storage devices.  Lucas Mearian. ComputerWorld.  May 3, 2010.
Imation announced a new line of products, four flash drives, two hard drives and an optical line of Blu-ray discs and removable tape cartridges, all with a range of encryption and security management tools.


shortDOI™ Service.  Web site. International DOI Foundation.  May 6, 2010.
A new service is available to transform long DOI names , which are often very long strings. The service creates short handles of the name.  A DOI (Digital Object Identifier) is a persistent name that can be give to internet resources, instead of a url which can change.  The short DOI service returns a short cut that will resolve to the same object as the log DOI form. 


Friday, April 30, 2010

Digital Preservation Matters - April 30, 2010

Digital Preservation: An Unsolved Problem. Jonathan Shaw. Harvard Magazine. April 27, 2010.

With the advantages of digital, why do libraries not embrace the digital future now? One of the main obstacles is the issue of preservation. For books: "the greatest risks to printed material are the environment, wear and tear, security, and custodial neglect." For digital: using data is one of the best ways to preserve it because you know it is usable; digital data must be read and checked constantly to ensure integrity. Another concern about digital is that current formats may not be readable in the future (reference to June 2009 New Yorker cover). Born digital materials are not as easy to save since they have many different formats. This is difficult for librarians keeping records of the university's intellectual life, because of both the legal and digital challenges. "We are in a period of unprecedented lack of documentation of academic output."


Gutenberg 2.0. Harvard's libraries deal with disruptive change. Jonathan Shaw. Harvard Magazine. April 27, 2010.

In the scientific disciplines, information, from online journals to databases, must be recent to be relevant. Books in libraries to some seem more like a museum. Some think that massive digital projects will make research libraries irrelevant. The future of libraries is clearly digital. "Yet if the format of the future is digital, the content remains data. And at its simplest, scholarship in any discipline is about gaining access to information and knowledge." Access to the information will mean different things and be done in different ways. In the meantime, "Who has the most scientific knowledge of large-scale organization, collection, and access to information? Librarians."

How do we deal with large scale collections and the access to the information? "We ought to be leveraging that expertise to deal with this new digital environment. That's a vision of librarians as specialists in organizing and accessing and preserving information in multiple media forms, rather than as curators of collections of books, maps, or posters." The role of libraries isn't going away, but it is changing.

The idea that libraries will be stewards of vast data collections raises very serious concerns about the long-term preservation of digital materials. The worry is that the longevity of the resources has not been tested. There are 3 copies of the 109 TB Harvard repository. It is in a constant process of checking and refreshing to make sure everything is readable.


The Floppy is Dead: Time to Move Memories to the Cloud. Lance Ulanoff. PC Magazine. Apr 26, 2010.

The decision by Sony to stop producing 3.5-inch disks marks an end to that format. The end of any popular format can have a ripple effect on the technology world. If the data is not migrated to later formats it could "trapped on its obsolete format". All media will become obsolete sometime, it is the natural progression of technology. Since change is inevitable the article suggests everyone consider cloud-based backup storage options. It suggests that this is better than storing data on eventually-to-be-obsolete media.


Google is not the last word in information. Lia Timson. Sydney Morning Herald. April 29, 2010.

Interesting article concerning primary and secondary sources, what is on the internet and how it gets there, special collections, etc.

  • "Better still is the lesson and the realisation that information and history don't just appear on Google. Someone has to publish it onto the web, put it there in the first place."
  • "As educators we must ask that assignment bibliographies include more than just "three websites". We must insist on a variety of media as sources, including interviews with real people, be they witnesses, historians or surviving relatives, and even insist on trips to the local library."
  • … researching is much wider and deeper than searching online.


A Gentle Reminder to Special-Collections Curators. Todd Gilman. The Chronicle of Higher Education. April 29, 2010.

Article and a librarian's experience trying to use special collections. The "job is not to keep readers from your books but just the opposite: to facilitate readers' use of the collections."


Friday, April 23, 2010

Digital Preservation Matters - April 23, 2010

National Archives Reports on Federal Agency Records Management Programs. NARA Press Release. April 19, 2010.

NARA issued a mandatory records management self-assessment to 245 Federal cabinet-level agencies and related groups, and 91% responded. The goal was to determine how effective Federal agencies are in meeting the statutory and regulatory requirements for records management. The study showed that 79% of agencies are falling short in their responsibilities. The long-term success of the Open Government initiative and the ability to ensure access to the records of our government, hinges on the ability of each Federal agency to effectively manage its records.

View the 93 page report.


Library of Congress Digital Preservation Newsletter. Library of Congress. April 2010.

The newsletter includes information about a number of digital preservation initiatives. Some of them are:

  • A new video "Why Digital Preservation is Important for Everyone" which also includes a transcript. The main theme is that digital materials, which can fail or be lost, require active management. The three minute video is worth watching.
  • The Federal Agencies Digitization Guidelines Initiative is helping government agencies preserve audio-visual information.
  • Links to The Blue Ribbon Task Force on Sustainable Digital Preservation and Access and their recent report, Sustainable Economics for a Digital Planet: Ensuring Long-Term Access to Digi­tal Information.
  • Link to a podcast "Conversations about Digital Preservation" about the Library's challenges to build an efficient, scalable digital repository, how the Library's repository works and future plans for the repository
  • A group of institutions have developed an automated way to preserve official e-mail records produced by Microsoft Outlook and capture the necessary long-term preservation metadata. This is part of the Persistent Digital Archives and Library System project (PeDALS) to develop a shared curatorial framework for preserving digital public records across multiple states.
  • May 10th will be the Personal Archiving Day at the Library of Congress.


NEW Blog from the DuraSpace Preservation & Archiving Solution Community. Carol Minton Morris. DuraSpace Preservation & Archiving. April 21, 2010.

A new blog has been set up by the Preservation and Archiving Solution Community. The blog is a vehicle for an open exchange of ideas and initiatives around preservation & archiving solutions. All are welcome to participate. It had started as a group using Fedora Commons, but is actually looking at all preservation issues, not just those for Fedora or DSpace.


Digital Preservation and the Challenge. Ron Jantz. DuraSpace Preservation & Archiving. April 21, 2010.

Institutions around the world are grappling with the technology, processes, and organizational structures that will result in digital preservation becoming a reality. The challenge to preserve information goes back centuries to those trying to preserve books in the past, and mentions a example when the Reformation dissolved the monasteries, and the books were not preserved. Can we demonstrate that we are preserving what we have now? We should be looking at self assessment tools to see how we are doing with preservation.


Crowdsourcing: How and Why Should Libraries Do It? Rose Holley. D-Lib Magazine. March/April 2010.

Crowdsourcing is a new term referring to undefined groups of people in a community "taking tasks traditionally performed by an employee or contractor and outsourcing it to a group (crowd) of people or community in the form of an open call." It may be the "most useful tool a library can have in the future." The work can be done as a group or as an individual. Libraries already know about the first step of crowdsourcing: social engagement with individuals, but need to improve in the second step: defining and working towards group goals. This can bring benefits to libraries and users, especially by adding value to data by adding comments, tags, ratings, reviews. Some successful examples include collections at the National Library of Australia, FamilySearchIndexing and Latter Day Saints: Text transcription of records, Wikipedia, etc. These released their services 'quietly' with little or no advertising, but clear group goals. The article looks at the volunteer profile, motivational factors, types of acknowledgement and rewards, managing volunteers, and tips for successful crowdsourcing. "Freedom is actually a bigger game than power. Power is about what you can control. Freedom is about what you can unleash".

Some of the tips:

  1. Have a transparent and clear goal on your home page
  2. Have a transparent and visible chart of progress towards your goal.
  3. Make the overall environment easy to use, intuitive, quick and reliable.
  4. Make the activity easy and fun; it must be interesting.
  5. Keep the site active by addition of new content/work.
  6. Give volunteers options and choices
  7. Make the results/outcome of your work transparent and visible.
  8. Let volunteers identify and make themselves visible if they want acknowledgement.
  9. Reward high achievers by having ranking tables and encourage competition.
  10. Give the volunteers an online team/communication environment to build a dynamic, supportive team environment.
  11. Treat your 'super' volunteers with respect and listen to them carefully.
  12. Assume volunteers will do it right rather than wrong.


Friday, April 16, 2010

Digital Preservation Matters - April 16, 2010

State Of America's Libraries Report 2010. American Library Association. April 11, 2010.

Interesting report about libraries. As the recession continues, Americans turn to libraries in ever larger numbers for access to resources for employment, continuing education, and government services. The local library has become a lifeline of resources, training and workshops. Even in the age of Google, academic libraries are being used more than ever. During a typical week in fiscal 2008, academic libraries in the United States had more than 20.3 million visits, answered more than 1.1 million reference questions, and made more than 498,000 presentations to groups attended by more than 8.9 million students and faculty, increases over the previous years. Over 43% of libraries provide access to locally produced digitized collections.


A National Conversation on the Economic Sustainability of Digital Information. Blue Ribbon Task Force on Sustainable Digital Preservation and Access. April 1, 2010. [Silverlight video.]

This page has the agenda and video presentations from A National Conversation on the Economic Sustainability of Digital Information, a recent meeting hosted by the Blue Ribbon Task Force on Sustainable Digital Preservation and Access.

BRTF's Featured Agenda and Presentations:

  • Research Data, Daniel E. Atkins, Wayne Clough,
  • Scholarly Discourse, Derek Law, Brian Schottlaender,
  • Economics of Collectively-Created Content, George Oates, Timo Hannay
  • Commercially-owned Cultural Content, Chris Lacinak, Jon Landau
  • Economics of Digital Information, William G. Bowen, Hal R. Varian, Dan Rubinfeld
  • Summary by Clifford Lynch.


How Tweet It Is!: Library Acquires Entire Twitter Archive. Matt Raymond. Blog. Library of Congress. April 14, 2010.

The Library of Congress is digitally archiving every public tweet made since Twitter started in 2006. "Expect to see an emphasis on the scholarly and research implications of the acquisition." Amazing to think what we can "learn about ourselves and the world around us from this wealth of data. And I'm certain we'll learn things that none of us now can even possibly conceive." The Library of Congress has been archiving information from the web since 2000. It now has more than 167 terabytes of web-based information, including legal blogs and political websites.


Library of Congress: We're archiving every tweet ever made. Nate Anderson. Ars Technica. April 16, 2010.

Comments about the Library of Congress archiving tweets:

  • There's been a turn toward historicism in academic circles over the last few decades, a turn that emphasizes not just official histories and novels but the diaries of women who never wrote for publication, or the oral histories of soldiers from the Civil War, or the letters written by a sawmill owner. The idea is to better understand the context of a time and place, to understand the way that all kinds of people thought and lived, and to get away from an older scholarship that privileged the productions of (usually) elite males."
  • Digital technologies pose a problem for the Library and other archival institutions, though. By making data so easy to generate and then record, they push archives to think hard about their missions and adapt to new technical challenges."


Aligning Investments with the Digital Evolution: Results of 2009 Faculty Survey Released. Roger C. Schonfeld, Ross Housewright. Ithaka. April 07, 2010. [37p. PDF]

An excellent report for academic libraries especially, Faculty Survey 2009: Strategic Insights for Librarians, Publishers, and Societies, that looks at faculty attitudes towards the academic library, information resources, and the scholarly communications system. A few quotes from the report:

  • Faculty most often turn to network-level services, including both general purpose search engines and services targeted specifically to academia.
  • Of all disciplines, scientists remain the least likely to utilize library-specific starting points;
  • Network-level services are increasingly important for discovery, not only of monographs and journals but archival resources and other primary source collections.
  • The library must evolve to meet these changing needs.
  • 90% of faculty members view the library buyer role as very important, 71% and 59% now view the archive and gateway roles as very important, respectively. Archiving is the 2nd highest role.
  • Despite the reported declines in importance of all the library's roles other than as a buyer, the 2009 study saw a slight rise in perceived dependence on the library
  • The declining visibility and importance of traditional roles for the library and the librarian may lead to faculty primarily perceiving the library as a budget line, rather than as an active intellectual partner.
  • Faculty members most strongly support and appreciate the library's infrastructural roles, in which it acquires and maintains collections of materials on their behalf.
  • Faculty members sense of the significance of long-term preservation of electronic journals has steadily increased over time
  • Effective and sustainable models for the preservation of electronic journals must be developed
  • Scholars, regardless of field, indicate a general preference that digital materials be preserved.
  • Less than 30% of faculty members have deposited any scholarly material into a repository; nearly 50% have not deposited but hope to do so in the future
  • Faculty attitudes and practices are at the strategic core. Greater engagement with and support of trailblazing faculty disciplines may help develop the roles and services to serve faculty needs into the future. The institutions that serve faculty must also anticipate them, both to ensure that the 21st century information needs of faculty are met and to secure their own relevance for the future.

Friday, April 09, 2010

Digital Preservation Matters - April 9, 2010

Blu-ray Disc Association Announces Additional Format Enhancements. Press Release. April 3, 2010.

The Blu-ray Disc Association announced two new media specifications:

  • The BDXL specification, targeted at broadcasting, medical and document imaging needs, has write-once discs of 100GB and 128GB capacity, and rewritable capability on 100GB discs. The discs use three to four recordable layers. A consumer version of BDXL is also expected sometime.
  • The Intra-Hybrid Blu-ray Disc has both a 25GB read only layer and 25GB rewritable layer and a single BD-RE layer so both needs can be met with one disc.

The two new types of discs require newly-designed hardware to record and play back.


Effort Will Help Libraries Put Academic Papers in Data 'Cloud'. Jeff Young. April 5, 2010

Some librarians are hoping that cloud computing will help their efforts to build institutional repositories, university wide collections of research papers. A new project sponsored by DuraSpace (a merger of DSpace and Fedora Commons) is called DuraCloud. This project plans to make it easier for librarians to put their repositories in off-site data storage. "A key design feature of DuraCloud is to leave the basics of pure storage to those who do it best (storage providers)." The project is now in the pilot phase, but should be available by the fall of 2010. "The biggest draw of the approach: It can be much cheaper than building new data centers to run on campuses.”


Submission Policy Recommendations. Chris Prom. Practical E-Records. March 24, 2010.

Here are some great policy documents that are an essential first step toward creating an active digital preservation plan. There are links on this page to several documents:

  • E-Records Deposit Policy
  • Preservation/Access Plan
  • Transfer Guidelines
  • E-record Survey Form
  • Submission Agreement Form

There is also a link to the do-it yourself TDR (Trusted Digital Repository). The preservation access plan is especially helpful because it looks at supported formats, both access and preservation formats, access tools for the formats, and migration path.


iPRES 2009: the Sixth International Conference on Preservation of Digital Objects. University of California. March 30, 2010.

The proceedings and videos from iPRES 2009 (held in San Francisco on Oct 5-6 2009) are now available online. The proceedings are available through the California Digital Library’s eScholarship site. The conference program, presentations, and videos are available at this link. There are many excellent resources here.


Friday, April 02, 2010

Digital Preservation Matters - April 2, 2010

Avoiding a Digital Dark Age. Kurt D. Bollacker. American Scientist. March-April 2010.

Data longevity depends on both the storage medium and the ability to decipher the information

The general problem of data preservation is twofold. The first matter is preservation of the data itself: The physical media on which data are written must be preserved, and this media must continue to accurately hold the data that are entrusted to it. This problem is the same for analog and digital media, but unless we are careful, digital media can be more fragile.

The second part of the equation is the comprehensibility of the data. Even if the storage medium survives perfectly, it will be of no use unless we can read and understand the data on it. Unlike in the analog world, digital data representations do not inherently degrade gracefully, because digital encoding methods represent data as a string of binary digits (“bits”). Because any single piece of digital media tends to have a relatively short lifetime, we will have to make copies far more often than has been historically required of analog media. Like species in nature, a copy of data that is more easily “reproduced” before it dies makes the data more likely to survive.

In order to survive, digital data must be understandable by both the machine reading them and the software interpreting them. There are at least two effective approaches: choosing data representation technologies wisely and creating mechanisms to reach backward in time from the future.


A Survey of the Scholarly Journals Using Open Journal Systems. Brian D. Edgar, John Willinsky. Educause Resources. March 4, 2010. [40 p. PDF]

Open Journal Systems (OJS) is an open source, online journal management and publishing platform. This study looks at scholarly communications using the open source software systems. survey to which 998 editors or staff members responded. The results point to how these journals – largely independent, scholar-published titles with roughly half

originating in the developing world – are not otherwise represented. Of the survey, 40 percent published research in the sciences, technology and medicine, 30 percent were social science journals, and 11 percent were in the humanities. 19 percent of the journals in the study were interdisciplinary.

The number of journals using OJS has been growing at an average rate of 81% per year. And the number of new journals that are starting, are using OJS at a rate of 47%. About half the journals using OJS are born digital. OJS looks at the effect that open source tools can have on journal publishing, and adds to the case for rethinking scholarly communication.


Ensuring Perpetual Access: establishing a federated strategy on perpetual access and hosting of electronic resources for Germany. The Alliance of German Science Organisations. Final Report in English. March 30, 2010. [177p. PDF.]

Increasing digital content is a challenge for scientific institutions. This study is a basis for a national hosting strategy to “establish and finance sustainable structures for perpetual access as well as long-term preservation for electronic resources.” Research is critical to the economy. Large investments into the research need to be safeguarded and maintained. Any loss can impair research, and ensuring future access is an important challenge. One of the largest gaps is the “provision for perpetual access for e-journals.” Library access via hosting on publishers’ servers is not “sufficiently robust as a single perpetual access solution long-term,” though it may be the immediate approach. Independent perpetual access with partners is needed, such as Portico. There needs to be a “strategy to create an infrastructure for the storage and long-term preservation of digital documents, and which can guarantee perpetual access to licensed commercial publications and retro-digitised library materials.” PDF and XML with the NLM-DTD are becoming a metadata standard for published material.


Jhove2-0.6.0 Download. Website. March 19, 2010.

A new alpha release of JHOVE2 is now available for download and evaluation. Some features include:

  • Format identification, validation, feature extraction, and message digest.
  • Recursive processing of directories, file sets, etc.
  • Integration with DROID for file identification.
  • Results formatted as text and XML


Friday, March 26, 2010

Digital Preservation Matters - March 26, 2010

Archiving Britain's web: The legal nightmare explored. Katie Scott. Wired. 05 March 2010.

Websites are increasing recognized as being culturally valuable. But there are concerns about the ability to preserve them because of current copyright requirements. The British Library over the past 6 years has archived over 6,000 culturally significant websites. Currently they must contact every copyright holders of these sites, and only have a 24% response rate. Some feel there is a "'digital black hole' in the nation's memory" because of the difficulty in archiving the web sites. There is a proposal to change the law to allow the copy deposit act to include websites. Some look at an opt out option. The BBC has a "no take-down" rule.


Canterbury Tales manuscript to be digitized. Medieval news. March 22, 2010.

The University of Manchester Library is planning to digitize the Canterbury Tales manuscript. This is part of a JISC funded project. The Centre of Digital Excellence supports universities, colleges, libraries and museums which lack the resources to digitize important works. In addition to the digitizing work, “they will also be exploring business models for the long term viability of digitisation.”


ISO Releases Archival Standards. eContent. Mar 23, 2010.

Two documents from the International Organization for Standardization (ISO) aim to provide guidelines for archiving patient information. "Health informatics-Security requirements for archiving of electronic health records-Principles" and "Health informatics-Security requirements for archiving of electronic health records-Guidelines" look at topics of records maintenance, retention, disclosure, and eventual destruction. Electronic medical data must be stored for the life of the patient; there are legal, ethical, and privacy concerns.


Elsevier and PANGAEA Data Archive Linking Agreement. Neil Beagrie. Blog. 03 Mar 2010.

Elsevier and the data library PANGAEA (Publishing Network for Geoscientific & Environmental Data) have agreed to reciprocal linking of their content in earth system research. Research data sets deposited at PANGAEA are now automatically linked to the corresponding articles in Elsevier journals on ScienceDirect. Science is better supported through the cooperation and the flow of data into trusted archives. “This is the beginning of a new way of managing, preserving and sharing data from earth system research.”


Duplicating Federal Videos for an Online Archive. Brian Stelter. The New York Times. March 14, 2010.

The International Amateur Scanning League plans to upload the National Archives’ collection of 3,000 DVDs in an “experiment in crowd-sourced digitization” using a DVD duplicator and a YouTube account. This is a small demonstration that volunteers can sometimes achieve what bureaucracies can’t or won’t. the DVDs are all technically available to the public, they are hard to see unless a person visits the archive or pays for a copy. The volunteers duplicate the DVDs then upload them to YouTube, the Internet Archive Web site and an independent server.


Uncompressed Audio File Formats. JISC Digital Media. 10 February 2010.

This looks at the main features of uncompressed audio file types, including WAV, AIFF and Broadcast WAV (BWF). “Uncompressed audio files are the most accurate digital representation of a soundwave” but they also take the most resources. Digital audio recording measures the level of a sound wave at regular intervals and records that value as a number. “This bitstream is the ‘raw’ audio data, expressing the sound wave in its closest digital analogue. “ These uncompressed audio file types are ‘wrapper’ formats that take the original data and combine it with additional data to make it compatible with other systems.

The most common is the Waveform Audio File Format (WAV), which is limited to a 4 Gb file size. The European Broadcasting Union created the Broadcast Wave Format (BWF) which is functionally identical to the WAV file except it has an extra header file for metadata. This is a recommended archive format and also has a 4 Gb file size. The European Broadcasting Union has recently added the Multichannel Broadcast Wave Format (MBWF)which combines the RF64 audio format (surround sound, MP3, AAC, etc) with a 64 bit address header and has a file size limit of 18 billion Gb. It is backwardly compatible with WAV and BWF. The Audio Interchange File Format (AIFF) is the native format for audio on Mac OSX.

“The International Association of Sound and Audiovisual Archives (IASA) recommend Broadcast WAV as a suitable archival format, for reasons of its wide compatibility and support, and its embedded metadata capability. For surround-sound or multichannel audio the MBWF format should be used. For archive PCM audio, bit depth should be a minimum of 24-bit, and sample rate a minimum of 48kHz to comply with IASA standards.” If compression is needed, lossless compression, which requires an additional encoding/decoding stage – codec) is the least destructive alternative. Some open-source lossless compression codecs are available, such as FLACC.


Court Orders Producing Party to "Unlock" PDF Since Not in a "Reasonably Usable" Form. Michael Arkfeld . Electronic Discovery and Evidence - blog. February 15, 2010.

In this contractual action, the defendants disclosed 11,757-page summary in a PDF "locked" format precluding the plaintiff from being able to edit and or manage the summary without retyping it. The Court found that the defendants' locked format made it "completely impractical for use" and ordered that the defendants "unlock" the files.


Tuesday, March 16, 2010

Digital Preservation Matters - March 16 2010

Fending Off Digital Decay, Bit by Bit. Patricia Cohen. The New York Times. March 15, 2010.

This looks at the archival material, including digital, from an author that is on display at Emory University. It highlights what research libraries and archives are discovering, that “born-digital” materials are much more complicated and costly to preserve than anticipated. The “archivists are finding themselves trying to fend off digital extinction at the same time that they are puzzling through questions about what to save, how to save it and how to make that material accessible.” Computers have now been used for over two decades, but their digital materials are just now find their way into archives. The curator said “We don’t really have any methodology as of yet to process born-digital material. We just store the disks in our climate-controlled stacks, and we’re hoping for some kind of universal Harvard guidelines.” The challenges including cataloging the material, acquiring the equipment and expertise to access the data stored on obsolete media. Do they try to save the look and feel of the material or just save the content? The computer editing meant that there are no manuscripts with pages with “lots of crossings-out and scribbling”. The display is providing the “emulation to a born-digital archive” similar to reproducing the author’s work environment. Emory is providing $500,00 to produce a computer forensics lab to do this kind of work. Others are impressed with the emulation, but their focus is storage and preservation of digital content. One center is trying to raise money to hire a to hire a digital collections coordinator. Until then, the digital materials are unavailable to researchers.


More on using DROID for Appraisal. Chris Prom. Practical E-Records. March 10, 2010.

The information that DROID supplies is useful but the output not optimally organized for reuse. But by regularizing the DROID CSV output the information became sortable and more useful. DROID was also useful in identifying files that did not use the standard file extension for an application, also to find files that needed attention or need to be converted. And it was very useful in the appraisal process. With it, the major migration problems could be identified and it helped to weed out inappropriate, duplicate, or private content.


Data, data everywhere. Economist. February 25, 2010.

The world contains an unimaginably vast amount of digital information which is increasing rapidly. This makes it possible to do many things that previously could not be done but it is also creating a host of new problems. The proliferation of data is making them increasingly inaccessible. The way that information is managed touches all areas of life. The data-centered economy is still new and the implications are not yet understood.


Archon™: The Simple Archival Information System. Website. 15 February 2010.

Version 3 of this software has been released. The software is for archivists and manuscript curators. It publishes archival descriptive information and digital archival objects to a user-friendly website. Functionality includes:

· Create standards-compliant collection descriptions and full finding aids using web forms.

· Describe the series, subseries, files, items, etc. within each collection.

· Upload digital objects/electronic records or link archival descriptions to external URLs.

· Batch import data

· Export MARC and EAD records


Deluge of scientific data needs to be curated for long-term use. Carole L. Palmer. February 24, 2010.

Data curation is the active and ongoing management of data through their lifecycle. It is an important part of research. Data is a valuable asset to institutions and to the scientific enterprise. Saving the publications that report the results of research isn't enough; researchers also need access to data. Data curation begins long before the data are generated, it needs to start at the proposal stage. Without the data there is the issue of replicating and validating a research project's conclusions. "Digital content, including digital data, is much more vulnerable than the print or analog formats we had before." selecting, appraising and organizing data to make them accessible and interpretable takes a lot of work and expense. "The bottom line is that many very talented scientists are spending a lot of time and effort managing data. Our aim is to get scientists back to doing science, where their expertise can make a real difference to society."


Is copyright getting in the way of us preserving our history? Victor Keegan. The Guardian. 25 February 2010.

In theory, future historians will have a lot of information about our age. In reality, much of it may be lost. Much of the information is on web pages, and they have a short life expectancy. The British Library has launched the UK Web Archive, which will guarantee longevity to thousands of hand-picked UK websites. But this is only a small part. “The issue of copyright is a global nightmare for anyone interested in digital preservation.”


"Zubulake Revisited: Six Years Later": Judge Shira Scheindlin Issues her Latest e-Discovery Opinion. Electronic Discovery Law. January 27, 2010.

This review of a case that addresses the issues of parties’ preservation obligations. Check here for the full opinion. The case revisits an earlier decision concerning e-discovery, or finding electronic documents, emails, etc, in court cases; obligations; and negligence for failure to keep records correctly. Some statements from the court opinion:

  • By now, it should be abundantly clear that the duty to preserve means what it says and that a failure to preserve records, paper or electronic, and to search in the right places for those records, will inevitably result in the spoliation of evidence.
  • While litigants are not required to execute document productions with absolute precision, at a minimum they must act diligently and search thoroughly at the time they reasonably anticipate litigation.
  • The following failures support a finding of gross negligence, when the duty to preserve has attached: to issue a written litigation hold; to identify all of the key players and to ensure that their electronic and paper records are preserved; to cease the deletion of email or to preserve the records of former employees that are in a party's possession, custody, or control; and to preserve backup tapes when they are the sole source of relevant information or when they relate to key players, if the relevant information maintained by those players is not obtainable from readily accessible sources.
  • The case law makes crystal clear that the breach of the duty to preserve, and the resulting spoliation of evidence, may result in the imposition of sanctions by a court because the court has the obligation to ensure that the judicial process is not abused.

Friday, March 12, 2010

A New Approach to Web Archiving

At the Marriott Library, we’ve recently begun looking into what it would take to archive websites that are important to the University. During some research into this area, I came across the proceedings of the 2009 International Web Archiving Workshop (IWAW).

An interesting project is taking place in France that may change the way web archiving is approached. At University P. and M. Curie in Paris, researchers are developing a web crawler that will not only detect changes to a website but one that will be able to detect which changes are unimportant (changing ads on a page, etc.) versus which are important to the page’s content. If successful, this might greatly improve the effectiveness of the web archiving system because digital archives would no longer be gumming up bandwidth and storage space with needless data.

This project is taking place in conjunction with the French National Audio-Visual Institute (INA). The institute would like to archive French television and radio station websites. The visual component of the institute’s pages is very important to the project, not just the content.

According to the workshop proceedings, the project idea is to “use a visual page analysis to assign importance to web pages parts, according to their relative location. In other words, page versions are restructured according to their visual representation. Detecting changes on such restructured page versions gives relevant information for understanding the dynamics of the web sites. A web page can be partitioned into multiple segments or blocks and, often, the blocks in a page have a different importance. In fact, different regions inside a web page have different importance weights according to their location, area size, content, etc. Typically, the most important information is on the center of a page, advertisement is on the header or on the left side and copyright is on the footer. Once the page is segmented, then a relative importance must be assigned to each block…Comparing two pages based on their visual representation is semantically more informative than with their HTML representation.”

The main concept and hopeful contribution to the world of web archiving is summed up by the presenters as follows:

• A novel web archiving approach that combines three concepts: visual page analysis (or segmentation), visual change detection and importance of web page’s blocks.

• An extension of an existing visual segmentation model to describe the whole visual aspect of the web page.

• An adequate change detection algorithm that computes changes between visual layout structures of web pages with a reasonable complexity in time.

• A method to evaluate the importance of changes occurred between consecutive versions of documents.

• An implementation of our approach and some experiments to demonstrate its feasibility.

It will be interesting to follow up with this project as it reaches its conclusion and see how its results will affect current web archiving players like as well as fellow research endeavors like the Memento Project.

You can read about this project in much more technical detail at the IWAW website (unless it’s been taken down and hasn’t been properly archived).

Thursday, March 11, 2010

Digital Preservation Matters - March 9, 2010

Accelerated Life Cycle Comparison of Millenniata Archival DVD [corrected link].. Ivan Svrcek. Naval Air Warfare Center. March 2010. [75 p. PDF]
The Life Cycle and Environmental Engineering branch at China Lake installation performed an accelerated aging test of Millenniata discs with current archival grade DVDs (Delkin, MAM-A, Mitsubishi, Taiyo Yuden, and Verbatim). The test evaluated the disc stability when exposed to combined light, heat and humidity. Besides using the standard tests for predicting the lifetime of a disc, the test included looking at the initial write quality and exposure to full spectrum of light. The test also looked at the drives used to burn the discs, and which drives worked best with which discs. One conclusion with the drives was that “the device used to record an optical media can have a great impact upon the write quality and should be considered in all data storage situations.” According to the ECMA standards, “All dye-based discs failed.” That is in contrast to the Millenniata discs: “none of the Millenniata media suffered any data degradation at all. Every other brand tested showed large increases in data errors after the stress period. Many of the discs were so damaged that they could not be recognized as DVDs by the disc analyzer.”

“Ensuring that valuable digital assets will be available for future use is not simply a matter of finding sufficient funds. It is about mobilizing resources—human, technical, and financial—across a spectrum of stakeholders.” Major questions are what should we preserve, who is responsible, and who will pay for it. This looks at scholarly publications, research data, commercially owned culture content, and collectively produced web content. Three important components in developing preservation strategies :
  1. When talking about preservation, make the case for use of the materials. A decision to preserve something now does not mean a permanent commitment of resources. The value and use may be clearer later.
  2. Incentives to preserve must be clearly shown as being in the public interest.
  3. There must be agreement on the roles and responsibilities of all concerned: the information creators, owners, preservers, and users.
It is important to reduce the cost of preservation as digital information increases. The areas for priority action include:
Organizational: Develop partnerships; ensure access to skilled personnel; sustain stewardship chain.
Technical : build capacity to support stewardship in all areas; lower the cost of preservation overall.
Policy: Create incentives; clarify rights of web materials; empower organizations.
Educational: promote education and training; raise awareness of the urgency of timely preservation actions.
  • Sustainable preservation strategies are not built all at once, nor are they static. Sustainable preservation is a series of timely actions taken to anticipate the dynamic nature of digital information.
  • Commitments made today are not commitments for all time. But actions must be taken today to ensure flexibility in the future.
  • Sustainable digital preservation requires a compelling value proposition, incentives to act, and well-defined roles and responsibilities.
  • Decisions about longevity are made throughout the digital lifecycle.
  • A sustainable preservation strategy must be flexible enough to span generations of data formats, access platforms, owners, and users.
  • Preservation decisions can often be seen as an incremental cost, and are often the same as decisions made to meet current demand.
Five conditions required for economic sustainability are:
  1. recognition of the benefits of preservation by decision makers;
  2. a process for selecting digital materials with long-term value;
  3. incentives for decision makers to preserve in the public interest;
  4. appropriate organization and governance of digital preservation activities; and
  5. mechanisms to secure an ongoing, efficient allocation of resources to digital preservation activities.
AAC Audio and the MP4 Media Format. JISC Digital Media. 12 February 2010.
From the JISC advice site: This is a guide to creating and using the AAC compressed audio resources. AAC is the successor to the MP3 format; this site explains the advantages of AAC over MP3. AAC offers significant reduction of audio file size while still retaining good sound quality. The AAC audio standard is a subsection of the MPEG-4 standard and the MP4 file type is often used to deliver content. Apple added the .m4a and .m4p extensions to designate audio content. AAC requires a compatible codec for the final user to be able to listen to it. AAC uses a lossy compression; so for standards-compliant sound archiving, Broadcast WAV format should be used according to the guidelines of the International Association of Sound and Audiovisual Archives (IASA). More on BWAV at the BBC site.
If you don’t need standards compliance or absolute fidelity for your archive, or if you don’t have the storage space for the much larger uncompressed BWAV files, “then you may want to consider AAC as the overall best currently available lossy compression method.” This is an excellent site for information and contains much more on the container, encoding, versions, filetypes, bitrate, metadata, the iTunes schema, and a simplified visual representation of an MP4.