Hadoop FileSystem support
Hadoop FileSystem support
Hadoop FS Support is an implementation of the Apache Hadoop File System interface, which opens several opportunities for you to work with the platform and other industry standard, big data tools while mitigating the need for you to write a lot of custom code to do so. A few such examples include:
- Read and write to Object store layer using Hadoop FS support in Spark (see the tutorial)
- Bring your data into HERE platform using an object store layer (see the tutorial)
- Connecting an object store layer to open-source processing and analytics tools outside the platform such as Apache Spark, Apache Drill, Presto, AWS EMR, and others.
Layer support
Hadoop FS Support is only available for the objectstore layer type.
Hadoop FileSystem interface support
The following Hadoop Filesystem methods are supported:
String getScheme()returns the scheme, which isblobfs.URI getUri()returns the URI, for exampleblobfs://hrn:here:data::olp-here:blobfs-test:test-data.void initialize(URI name, Configuration conf)initializes theBlobFs FileSystem.FSDataInputStream open(Path f, int bufferSize)provides anInputStreamto read data from an object stored in the Object Store layer.FileStatus getFileStatus(Path f)provides information about a file.FileStatus[] listStatus(Path f)provides a list of files along with their respective information.FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress)provides anOutputStreamto write data to. The parametersprogress,permissionandreplicationare not implemented.boolean rename(Path src, Path dst). Renames the path.boolean delete(Path f, boolean recursive)deletes a path, if therecursiveflag is set to true, all sub-directories and files are also deleted.boolean mkdirs(Path f, FsPermission permission)creates directories for the given path. The parameterpermissionis not implemented.void close()closes the file system.
Usage
For the catalog HRN hrn:here:data::olp-here:blobfs-test and the layer ID
test-data, the URL to be used in Hadoop/Spark/Drill is
blobfs://hrn:here:data::olp-here:blobfs-test:test-data.
Spark
Your spark application will need to have a dependency on the hadoop-fs-support
package.
val catalogHrn = "hrn:here:data::olp-here:blobfs-test"
val layerId = "test-data"
val sourcePath = s"blobfs://$catalogHrn:$layerId/source"
val destinationPath = s"blobfs://$catalogHrn:$layerId/destination"
val sourceRdd = sparkContext.textFile(sourcePath)
sourceRdd.saveAsTextFile(destinationPath)Hadoop fs shell
You can use the Hadoop File System Shell to explore the contents of the Object Store layer. The operations supported from the Hadoop File System Shell are the following:
- cat
- copyFromLocal
- copyToLocal
- count
- cp
- ls
- mkdir
- moveFromLocal
- moveToLocal
- mv
- put
- rm
- rmdir
- rmr
- test
- text
- touchz
- truncate
- usage
export HADOOP_CLASSPATH="hadoop-fs-support_2.13-${VERSION}-assembly.jar"
hadoop fs -mkdir blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -cp file.txt blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1
hadoop fs -ls blobfs://hrn:here:data::olp-here:blobfs-test:test-data/directory1Hadoop configurations
BlobFs supports the following Hadoop configurations:
- fs.blobfs.multipart.part-upload-parallelism BlobFs uploads an object by
splitting the object into various parts and using the multi-part upload
functionality of Object Store. This configuration defines how many parts for a
single object can be uploaded simultaneously. The minimum allowed parallelism
is
1. The default value is2. The upload speed can increase with an increased parallelism, doing that is more costly as each uploaded part is buffered in the memory. - fs.blobfs.multipart.part-size Size of each part of the object that is
uploaded in bytes. The minimum part size allowed is
5242880. The maximum part size allowed is100663296. The default value is100663296.
Authentication
For instructions on how to set up HERE credentials, see Get Your Credentials.
Additionally, HERE credentials can also be passed as Hadoop configuration.
- fs.blobfs.accessKeyId HERE access key ID. This configuration is the same
as
here.access.key.idin thecredentials.properties. - fs.blobfs.accessClientId HERE client ID. This configuration is the same as
here.client.idin thecredentials.properties. - fs.blobfs.accessKeySecret HERE access key secret. This configuration is
the same as
here.access.key.secretin thecredentials.properties. - fs.blobfs.accessEndpointUrl HERE token endpoint URL. This configuration is
the same as
here.token.endpoint.urlin thecredentials.properties.
Configuring Hadoop installations
EMR
In order to run on EMR, you will need to create an EMR cluster with an
additional parameter configurations, for example
aws emr create-cluster \
--name "$cluster_name" \
--release-label "emr-5.17.0" \
--applications Name="Spark" \
--region ${region} \
--log-uri ${log_location} \
--instance-type "m4.large" \
--instance-count 4 \
--service-role "EMR_DefaultRole" \
--ec2-attributes KeyName="some-key",InstanceProfile="EMR_EC2_DefaultRole",AdditionalMasterSecurityGroups="sg-xxxx",AdditionalSlaveSecurityGroups="sg-xxxxx",SubnetId="subnet-xxxx" \
--configurations file://emr-config.jsonThe content of the emr-config.json file is the following:
[
{
"Classification": "core-site",
"Properties": {
"fs.blobfs.impl": "com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem"
}
}
]Standalone Hadoop installations
- The BlobFS fat jar needs to be included on the Hadoop classpath.
- In some cases, the
core-site.xmlfile needs to have the following property to be added:
<property>
<name>fs.blobfs.impl</name>
<value>com.here.platform.data.client.hdfs.DataServiceBlobHadoopFileSystem</value>
</property>Drill
The BlobFS fat jar needs to be included on the Hadoop classpath.
NotesObject Store is not a true file system. Object Store is a distributed key-value store and as such, it does not behave exactly like a file system. A file system expects that operations such as
deleteorrenameare atomic. For Object Store, these operations will finish eventually. A file system expects that during reading from or writing to a file, the content of the file should not be changed or the file should not be deleted. Object Store does not assure this behavior. You will need to take precaution against this behavioral difference on your own.Hadoop FS Support implements Hadoop FileSystem version 2.7.3. It is highly likely that BlobFS will work with Hadoop versions up to 2.9.x, but it is not guaranteed to work. Hadoop version 3.x.x is not yet supported.
Compatibility issues with HRN. BlobFs requires HRN as the authority of the URI. Some tools create incorrect URIs for HRNs. To work around such cases, you can also pass the hex value of the HRN as the authority of the URI. For the catalog HRN
hrn:here:data::olp-here:blobfs-testand the layer IDtest-data, this looks as follows:val hrnHex = Hex.encodeHexString("hrn:here:data::olp-here:blobfs-test:test-data".getBytes) val blobFsUri = "blobfs://$hrnHex"Hadoop FS Support does not support Apache Flink.
Updated 22 days ago