<?xml version="1.0" encoding="UTF-8" standalone="yes"?><oembed><version><![CDATA[1.0]]></version><provider_name><![CDATA[The Multimedia Commons Initiative]]></provider_name><provider_url><![CDATA[https://multimediacommons.wordpress.com]]></provider_url><author_name><![CDATA[jychoi84]]></author_name><author_url><![CDATA[https://multimediacommons.wordpress.com/author/jychoi84/]]></author_url><title><![CDATA[Getting Started]]></title><type><![CDATA[link]]></type><html><![CDATA[
<p>The Multimedia Commons data&nbsp;and related resources are stored on Amazon <a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer">S3</a> (Simple Storage Service), in the <code>multimedia-commons</code>&nbsp;data bucket. This page explains how you can use Amazon Web Services and other tools to access the data, download it, or work with it&nbsp;directly in the cloud.</p>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Jump to:</strong><br><a href="#portal">Browsing the Data</a><br><a href="#download">Downloading the Data</a><br><a href="#copy">Copying the Data</a><br><a href="#mount">Mounting the Data</a><br><a href="#cft">Using our&nbsp;CloudFormation Template</a><br><a href="#ebs">Attaching an EBS Volume to an EC2 Instance</a><br><a href="#solr">Setting Up Solr to Search the Data</a></td><td><a href="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png" rel="attachment wp-att-363"><img data-attachment-id="363" data-permalink="https://multimediacommons.wordpress.com/yfcc100m-core-dataset/flickr_sample_20/" data-orig-file="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png" data-orig-size="1040,697" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="Flickr_Sample_20" data-image-description="" data-image-caption="" data-medium-file="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=300" data-large-file="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=1024" class="wp-image-363" src="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=250" alt="Sample image: Arrow made of multicolored leaves" width="250" srcset="https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=250 250w, https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=500 500w, https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=150 150w, https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=300 300w" sizes="(max-width: 250px) 100vw, 250px"></a></td></tr></tbody></table></figure>



<h2><a name="portal"></a><strong>Browsing&nbsp;the Data</strong></h2>



<p>You can&nbsp;explore&nbsp;the Multimedia Commons data through&nbsp;our <a href="http://multimedia-commons.s3-website-us-west-2.amazonaws.com/" target="_blank" rel="noopener noreferrer">data portal</a>.</p>



<hr class="wp-block-separator" />



<h2><a name="download"></a><strong>Downloading the Data</strong></h2>



<p>You can download the whole dataset or a portion of it to your local hard drive or to an AWS EBS volume attached to AWS EC2 instance. You need to provision an EBS volume with an adequate size to copy the data into.</p>



<p>If you just want to download individual files, you can also use our&nbsp;data portal to navigate to individual items and then save them directly to your computer. However, if you want to download larger batches of files, you probably will&nbsp;want to use specialized utilities. There are many tools out there that can be used for this, so try them out and use whichever you feel comfortable with. For example:</p>



<ul><li><a rel="noopener noreferrer" href="https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html" target="_blank">aws CLI</a> is a tool that provides interface for all parts of AWS including S3. An example aws CLI command:<br><code>aws s3 ls s3://multimedia-commons/data/images<br> aws s3 sync s3://multimedia-commons/data/images . </code></li><li><a rel="noopener noreferrer" href="http://s3tools.org/s3cmd" target="_blank">s3cmd</a> is useful for simple operations such as listing contents, downloading files, etc. Example s3cmd commands:<br><code>s3cmd ls s3://multimedia-commons/data/images/</code> <pre>s3cmd get --recursive s3://multimedia-commons/data/images/ local-dir</pre> </li></ul>



<hr class="wp-block-separator" />



<h2><a name="copy"></a><strong>Copying the Data</strong></h2>



<p>If you&#8217;re already using S3, you can also copy the data to your own S3 bucket.</p>



<p><strong>Step 1:</strong> To copy directly from a source bucket (we&#8217;ll call it <code>src_s3_bucket</code>) to a&nbsp;destination bucket (we&#8217;ll call it <code>dst_s3_bucket</code>), use the <code>cp</code> command of Amazon&#8217;s own <a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer">aws</a>&nbsp;command line utility:</p>



<ul><li style="text-align:justify;"><code> aws s3 cp s3://src_s3_bucket/ s3://dst_s3_bucket/ --recursive</code></li></ul>



<p>The <code>--recursive</code> flag specifies that ALL files must be copied, with the same directory structure as the original.</p>



<p>For example, to copy every MFCC20 audio feature file to your bucket, using the same directory structure, use something like this:</p>



<ul><li style="text-align:justify;"><code> aws s3 cp s3://multimedia-commons-2019migration/feats/audio/mfcc20/ s3://dst_s3_bucket/feats/audio/mfcc20 --recursive</code></li></ul>



<p><strong>Step 2:</strong> You can check the contents of the destination bucket using one of the following commands:</p>



<ul><li><code>aws s3 ls s3://dst_s3_bucket</code> <i>(lists all items &#8212; which may be a lot!)</i></li><li><code>aws s3 ls s3://dst_s3_bucket | wc -l</code> <i>(returns the number of files)</i></li></ul>



<hr class="wp-block-separator" />



<h2><a name="mount"></a><strong>Mounting&nbsp;the Data</strong></h2>



<p>If you have an account with Amazon&#8217;s <a href="https://aws.amazon.com/ec2/" target="_blank" rel="noopener noreferrer">EC2</a> (Elastic Cloud Computing) service, you can launch an EC2 instance and mount our&nbsp;S3 bucket that&nbsp;contains the&nbsp;Multimedia Commons. Then you can work in your&nbsp;EC2 instance without having to download the data at all (and thus without incurring the associated storage costs).&nbsp;The instructions below work whether you are mounting the <code>s3://multimedia-commons-2019migration</code> bucket or mounting your own bucket, in case you made a&nbsp;copy of our&nbsp;data.&nbsp;These instructions assume you already have an EC2 instance running. See <a href="#cft">the next section</a> for instructions on using the Multimedia Commons CloudFormation Template to launch an EC2 instance with some of our tools pre-installed, or see the&nbsp;<a href="https://aws.amazon.com/ec2/getting-started/" target="_blank" rel="noopener noreferrer">AWS EC2 documentation</a> for more general instructions.</p>



<p><strong>Step 1:</strong> If you just launched a new EC2 instance, update the system first.</p>



<ul><li>For Amazon Linux, CentOS or Red Hat:<br><code>sudo yum update</code></li><li>For Ubuntu or Debian:<br><code>sudo apt-get update</code></li></ul>



<p><strong>Step 2:</strong> Install needed&nbsp;dependencies.</p>



<ul><li>In Amazon Linux, CentOS or Red Hat:<br><code>sudo yum install gcc libstdc++-devel gcc-c++ fuse fuse-devel curl-devel libxml2-devel openssl-devel mailcap</code></li><li>In Ubuntu or Debian:<br><code>sudo apt-get install build-essential gcc libfuse-dev fuse-utils libcurl4-openssl-dev libxml2-dev mime-support build-essential libcurl4-openssl-dev</code></li></ul>



<p><strong>Step 3:</strong> Download the latest&nbsp;<a href="http://code.google.com/p/s3fs/downloads/list" target="_blank" rel="noopener noreferrer">s3fs</a>&nbsp;package to a local directory on your EC instance, untar it by executing&nbsp;<code>tar -xvzf s3fs-1.74.tar.gz</code>, enter&nbsp;the extracted s3fs directory by typing&nbsp;<code>cd s3fs-1.74cted s3fs directory</code>, and finally compile it:</p>



<ul><li style="text-align:justify;"><code>./configure --prefix=/usr</code></li><li style="text-align:justify;"><code>make</code></li><li style="text-align:justify;"><code>sudo make install</code></li></ul>



<p><strong>Step 4:</strong>&nbsp;Create&nbsp;an access key and secret key from the <a href="https://console.aws.amazon.com/iam/home#security_credential" target="_blank" rel="noopener noreferrer">AWS console</a>, if you haven&#8217;t done so yet.&nbsp;The Security Credentials page shows your access key and your secret key (click <em>Show</em> to make the secret key visible).</p>



<p><strong>Step 5:</strong>&nbsp;Save the access key and secret key to&nbsp;your EC2 instance.</p>



<ul><li>Create a new file in your <code>/etc</code> directory with the name <code>passwd-s3fs</code>, e.g. using the text editor vim.</li><li>Copy the access key and then the secret key from the Security Credentials page and paste them into the file with a colon in between them (no space):&nbsp;<code>accesskey<strong>:</strong>secretkey</code>.&nbsp;Do not hit enter on your keyboard&nbsp;after adding the keys.</li><li>Save the file and exit.</li><li>Update the permissions on your password file:&nbsp;<code>chmod 640 /etc/passwd-s3fs</code>.</li></ul>



<p><strong>Step 7:</strong> Create a directory in which to mount the S3 bucket, for instance in your home directory.</p>



<ul><li style="text-align:justify;"><code>mkdir ~/my_s3_bucket</code></li></ul>



<p><strong>Step 8:</strong> Mount the S3 bucket and make its contents accessible within the directory you just created.</p>



<ul><li style="text-align:justify;"><code>s3fs BUCKETNAME ~/my_s3_bucket</code></li></ul>



<p>where you should replace <code>BUCKETNAME</code>&nbsp;with the name of the bucket you want to&nbsp;mount, e.g.&nbsp;&nbsp;<code>multimedia-commons-2019migration</code>&nbsp;for our bucket. If you get an error that the s3fs utility&nbsp;cannot be found, add&nbsp;the complete path to where it is stored.&nbsp;You can find&nbsp;out where exactly on your file system the s3fs utility is&nbsp;stored by executing&nbsp;the command&nbsp;<code>which s3fs</code>.</p>



<p><strong>Step 9:</strong> Check that the S3 bucket was successfully mounted.</p>



<ul><li style="text-align:justify;"><code>df -Th ~/my_s3_bucket &nbsp;</code></li></ul>



<p>Note that <code>s3fs</code> will always return 256TB as the size of the disk.&nbsp;If you are trying to access a bucket you created yourself that contains data and this data is not visible, then you need to adjust&nbsp;the permissions in the access control list (ACL) for the bucket so that it can be read. You can do this using the AWS management console, see&nbsp;<a href="http://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html" target="_blank" rel="noopener noreferrer">here</a>&nbsp;for more&nbsp;information.</p>



<hr class="wp-block-separator" />



<h2><a name="cft"></a><strong>Using our&nbsp;CloudFormation Template to Launch EC2</strong></h2>



<p>AWS CloudFormation Templates provide a simplified way of launching an EC2 instance with an environment already set up for a particular task &#8212; in this case, working with the Multimedia Commons dataset. Here, we explain how to launch an EC2 instance with the Multimedia Commons template we prepared. Our template also pre-installs&nbsp;the audioCaffe analysis tool for you.</p>



<p><strong>Step 1:</strong>&nbsp;Download <a href="https://s3-us-west-2.amazonaws.com/multimedia-commons/tools/etc/MultimediaCommons-audioCaffe-v0.2.template" target="_blank" rel="noopener noreferrer">the Multimedia Commons CloudFormation Template</a>.</p>



<p><strong>Step 2:</strong> In your AWS Management Console, click <em>CloudFormation</em>.</p>



<p><strong>Step 3:</strong> Click <em>Create New Stack</em>.</p>



<p><strong>Step 4:</strong> Choose <em>Upload a template to Amazon S3</em> and upload the Multimedia Commons template you downloaded in Step 1.</p>



<p><strong>Step 5:</strong> Enter the name of the stack in the &#8220;Stack Name&#8221; field.</p>



<p><strong>Step 6:</strong> In the &#8220;Parameters&#8221; section, enter the name of the key pair for the EC2 instance in the &#8220;KeyName&#8221; field. You can <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair" target="_blank" rel="noopener noreferrer">create a new key pair</a>&nbsp;or use one you&#8217;ve already created.</p>



<p><strong>Step 7:</strong>&nbsp;Choose an EC2 instance type using the &#8220;audioCaffeInstanceType&#8221; field, according to the level of computing power you need. A list of instance types can be found on the <a href="https://aws.amazon.com/ec2/instance-types/" target="_blank" rel="noopener noreferrer">Amazon EC2 Instances</a> page. Click <em>Next</em>.</p>



<p><strong>Step 8:</strong> On the <em>Options</em> page, you can specify key-value pairs for describing the stack. Click <em>Next.</em></p>



<p><strong>Step 9:</strong> Review the settings and click <em>Create</em>. Wait for the status of your stack to change to <em>CREATE_COMPLETE</em>.</p>



<p><strong>Step 10:</strong> Navigate to the EC2 console. You will see that two EC2 instances, &#8220;audioCaffe Server&#8221; and &#8220;NAT (audioCaffe VPC)&#8221;, have been created as parts of the new stack. When its State is &#8220;running&#8221;, you can connect to the &#8220;audioCaffe Server&#8221; instance using the key pair you supplied in Step 6.</p>



<p><code>ssh -i KEYPAIRNAME.pem ubuntu@IPADDRESS</code></p>



<p>(Where <code>KEYPAIRNAME</code> should be replaced with the key pair name you entered in Step 6 and <code>IPADDRESS</code> should be replaced with the IP address of the audioCaffe server.)</p>



<p>Once connected, you will find that <em>audioCaffe</em> is already installed under the home directory. See&nbsp;<a href="https://multimediacommons.wordpress.com/featured-tools/#audiocaffe" target="_blank" rel="noopener noreferrer">the audioCaffe description</a>&nbsp;for more details.</p>



<p><a href="https://aws.amazon.com/cloudformation/aws-cloudformation-templates/" target="_blank" rel="noopener noreferrer">Click here for additional CloudFormation documentation from AWS.</a></p>



<hr class="wp-block-separator" />



<h2><a name="ebs"></a><strong><b>Attaching an EBS Volume to an EC2 Instance</b></strong></h2>



<p><strong>Step 1:</strong> In the EC2 console, click <em>Volumes</em> under &#8220;Elastic Block Storage&#8221; (in the left-hand menu).</p>



<p><strong>Step 2:</strong> Click <em>Create Volume</em>. In the pop-up window, change the &#8220;Size&#8221; of the volume (in GiB). Change the values of other fields if needed. Click <em>Create</em>.</p>



<p><strong>Step 3:</strong> Click on the volume you created, then in the <em>Actions</em> drop-down menu, click <em>Attach Volume</em>.</p>



<p><strong>Step 4:</strong> In the &#8220;Instance&#8221; field, you can choose from a list of EC2 instances. For example, you can click <em>audioCaffe Server</em> if you want to attach the volume to your EC2 instance created using the Multimedia Commons CloudFormation Template.</p>



<p><strong>Step 5:</strong> Log into the EC2 instance, then mount the EBS volume according to <a href="http://maplpro.blogspot.com.au/2012/05/how-to-mount-ebs-volume-into-ec2-ubuntu.html" target="_blank" rel="noopener noreferrer">these instructions</a>.</p>



<hr class="wp-block-separator" />



<h2><a name="solr"></a><strong><b>Setting Up Solr to Search the Data</b></strong></h2>



<p>Due to its sheer size, it is easiest to browse the contents of the dataset using a relational database or search platform, such as MySQL or PostgreSQL for basic SQL-style querying, or Apache Solr if you want more powerful search capabilities. An alternative public search possibility is the <a href="http://search.mmcommons.org/">Multimedia Commons Search</a>. However, it is still under development.</p>



<p>You can either download a search index or set up your own search. First, we describe how to set up solr with a downloaded search index.</p>



<p><strong>Step 1:</strong> Download Solr from the <a href="http://lucene.apache.org/solr/resources.html" target="_blank" rel="noopener noreferrer">Solr Resources page</a>. (This page also has extensive information and tutorials about how to use Solr in general.)</p>



<p><strong>Step 2:</strong> Use <a href="https://rclone.org/drive/">rclone</a> to download <a href="https://drive.google.com/open?id=0B6D7jCorgVvCLXdkLTNoN0NnS0E">the precompiled search index</a>. A normal download is not possible, because one file has the size of more than 35GB.</p>



<p><strong>Step 3:</strong> Move the index to <code>PATH/server/solr</code>. (Where &#8216;<code>PATH</code>&#8216; should be replaced with the path of the directory where you downloaded Solr.)</p>



<p><strong>Step 4:</strong> Start the server.</p>



<ul><li><code>PATH &gt; /bin/solr start </code></li></ul>



<p>Due to the large size of the search engine, memory issues might occur because the JVM is started with a very low amount of resources. In this case, increasing memory might be beneficial (<code>PATH &gt; ./bin/solr restart -m 12g </code>).<code>PATH &gt; ./bin/solr restart --help</code> can be used to get further start options for optimization. Under the solr dashboard at <a href="http://localhost:8983/solr/#/">http://localhost:8983/solr/#/</a>&nbsp; memory consumption and other properties can be checked at run time.</p>



<p><strong>Step 5:</strong> Open your web browser and connect to the web console:</p>



<ul><li><code>localhost:8983/solr</code>.</li></ul>



<p><strong>Step 6:</strong> Happy Searching!</p>



<p>Here, we explain how to create the database with Apache Solr.</p>



<p><strong>Step 1:</strong> Download Solr from the <a href="http://lucene.apache.org/solr/resources.html" target="_blank" rel="noopener noreferrer">Solr Resources page</a>. (This page also has extensive information and tutorials about how to use Solr in general.)</p>



<p><strong>Step 2:</strong> Under <code>PATH/server/solr</code>, create a directory for the new core. (Where &#8216;<code>PATH</code>&#8216; should be replaced with the path of the directory where you downloaded Solr.) For example, you might create a new directory called <code>yfcc100m</code>.</p>



<p><strong>Step 3:</strong> Download the <a href="https://github.com/multimedia-berkeley/mmc_search_solr" target="_blank" rel="noopener noreferrer">MMC search sample schema and config file</a> and put it in the new core directory.</p>



<p><strong>Step 4:</strong> Start the server.</p>



<ul><li style="text-align:justify;"><code>PATH &gt; /bin/solr start </code></li></ul>



<p>Due to the large size of the search engine, memory issues might occur because the JVM is started with a very low amount of resources. In this case, increasing memory might be beneficial (<code>PATH &gt; ./bin/solr restart -m 12g </code>).<code>PATH &gt; ./bin/solr restart --help</code> can be used to get further start options for optimization. Under the solr dashboard at <a href="http://localhost:8983/solr/#/">http://localhost:8983/solr/#/</a>&nbsp; memory consumption and other properties can be checked at run time.</p>



<p><strong>Step 5:</strong> Open your web browser and connect to the web console:</p>



<ul><li style="text-align:justify;"><code>localhost:8983/solr</code>.</li></ul>



<p><strong>Step 6:</strong> If the core is not yet present, click on <em>Add Core. </em>You can name this core whatever you want, but you will need to change the &#8220;instanceDir&#8221; field to the same directory name you used in Step 2. (The core does not need to have the same name as the directory, but it can.) Leave the other three fields unchanged. If you stored the directory as suggested in Step 3, it should be already available. Otherwise, you can use the add core functionality to store the core at any other place. Note that depending on user rights, access to the core might cause conflicts or security risks.</p>



<p><strong>Step 7:</strong> Populate the empty core with the data you want to work with. Solr allows you to directly upload a CSV (or any properly formatted file) to the core.</p>



<ol><li>Some files &#8212; for example, the metadata file for the YFCC100M dataset &#8212; need to be reformatted before they can be uploaded. To reformat the YFCC100M metadata file, you can use <a href="http://s3-us-west-2.amazonaws.com/multimedia-commons/tools/etc/Solr/reformat.py" target="_blank" rel="noopener noreferrer">this custom Python script</a>. It would be a good idea to do a test run with a small subset of the metadata file (e.g., first 10,000 lines of YFCC100M metadata file). Alternatively, you can download <a href="https://drive.google.com/file/d/0B6D7jCorgVvCaTRNQmRBQ0UzcUk/view?usp=sharing">a reformated table where some metadata has already been added</a> (17GB, <a href="https://rclone.org/drive/">rclone</a> probably required).</li><li>Update the core with the properly formatted file. For the YFCC100M metadata, you would use this command:
<pre>curl 'http://localhost:8983/solr/<strong>[CORE NAME]</strong>/update/csv?commit=true&amp;separator=%09&amp;header=true&amp;stream.file=<strong>[PATH TO REFORMATTED METADATA FILE]</strong>&amp;f.usertags.split=true&amp;f.usertags.separator=%2C&amp;f.machinetags.split=true&amp;f.machinetags.separator=%2Cf&amp;overwrite=true'</pre>
<p>This may also take up to an hour or longer depending on the power of your CPU. If you want to learn more about what each of the parameters do, you can check out the Solr Wiki&#8217;s instructions for <a href="https://wiki.apache.org/solr/UpdateCSV" target="_blank" rel="noopener noreferrer">Updating a Solr Index with CSV</a>. Unfortunately, uploading the CSV works only for adding new documents but not for modifying them like adding the data from extensions like autotags.</p>
</li></ol>



<p><strong>Step 8:</strong> Select the new core from the Core Selector dropdown menu.</p>



<p><strong>Step 9:</strong> Select &#8220;Query&#8221;. (To retrieve the first ten documents, you can just click <em>Execute Query</em>.)</p>



<p>The results of your query will be shown in the righthand panel.</p>



<p>(Note that you can also use the URL bar to formulate a query. The URL syntax you could have used will appear above your query results.)</p>



<p><strong>Step 10:</strong> Add the metadata from extensions like places or autotags.</p>
]]></html><thumbnail_url><![CDATA[https://multimediacommons.files.wordpress.com/2015/12/flickr_sample_20.png?w=250&fit=440%2C330]]></thumbnail_url><thumbnail_width><![CDATA[250]]></thumbnail_width><thumbnail_height><![CDATA[167]]></thumbnail_height></oembed>