Javaでbit.ly等の短縮URLを展開する方法

短縮URLを展開する必要があったため、高速な展開方法を実装しました。

短縮URLサービスは90個以上あるそうで、個別のAPIを使ってたら埒があかない。以下のコードは、bit.ly, t.co, goo.gl などでテスト済みだけど、あらゆる短縮URLサービスで利用可能だと思います。

public abstract class UrlUtility {
	public static URL expandUrl(URL aUrl) throws IOException, ProtocolException {
		final URLConnection tURLConnection = aUrl.openConnection(Proxy.NO_PROXY);
		if (!(tURLConnection instanceof HttpURLConnection)) {
			return aUrl;
		}
		final HttpURLConnection tHttpURLConnection = (HttpURLConnection) tURLConnection;
		tHttpURLConnection.setRequestMethod("HEAD");
		tHttpURLConnection.setInstanceFollowRedirects(false);
		tHttpURLConnection.connect();
 
		final String tExpandedUrl;
		final String tLocation = tHttpURLConnection.getHeaderField("Location");
		if (tLocation != null && tLocation.startsWith("http")) {
			final int tResponseCode = tHttpURLConnection.getResponseCode();
			if (tResponseCode == HttpURLConnection.HTTP_MOVED_PERM || tResponseCode == HttpURLConnection.HTTP_MOVED_TEMP) {
				return expandUrl(new URL(encode(tLocation))).toExternalForm();
			}
			tExpandedUrl = tLocation;
		} else {
			tExpandedUrl = tHttpURLConnection.getURL().toExternalForm();
		}
 
		return new URL(encode(tExpandedUrl));
	}
 
	// @formatter:off
	private static final String[] HEX = {
		"80","81","82","83","84","85","86","87","88","89","8A","8B","8C","8D","8E","8F",
		"90","91","92","93","94","95","96","97","98","99","9A","9B","9C","9D","9E","9F",
		"A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","AA","AB","AC","AD","AE","AF",
		"B0","B1","B2","B3","B4","B5","B6","B7","B8","B9","BA","BB","BC","BD","BE","BF",
		"C0","C1","C2","C3","C4","C5","C6","C7","C8","C9","CA","CB","CC","CD","CE","CF",
		"D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","DA","DB","DC","DD","DE","DF",
		"E0","E1","E2","E3","E4","E5","E6","E7","E8","E9","EA","EB","EC","ED","EE","EF",
		"F0","F1","F2","F3","F4","F5","F6","F7","F8","F9","FA","FB","FC","FD","FE","FF",
		"00","01","02","03","04","05","06","07","08","09","0A","0B","0C","0D","0E","0F",
		"10","11","12","13","14","15","16","17","18","19","1A","1B","1C","1D","1E","1F",
		"20","21","22","23","24","25","26","27","28","29","2A","2B","2C","2D","2E","2F",
		"30","31","32","33","34","35","36","37","38","39","3A","3B","3C","3D","3E","3F",
		"40","41","42","43","44","45","46","47","48","49","4A","4B","4C","4D","4E","4F",
		"50","51","52","53","54","55","56","57","58","59","5A","5B","5C","5D","5E","5F",
		"60","61","62","63","64","65","66","67","68","69","6A","6B","6C","6D","6E","6F",
		"70","71","72","73","74","75","76","77","78","79","7A","7B","7C","7D","7E","7F",
	};
	// @formatter:on
 
	private static String encode(String aUrl) throws UnsupportedEncodingException {
		final byte[] tBytes = aUrl.getBytes("ISO-8859-1");
		final int tLength = tBytes.length;
		final StringBuilder tBuilder = new StringBuilder(tLength * 3);
		for (int tIndex = 0; tIndex < tLength; tIndex++) {
			final int tIntAt = (int) tBytes[tIndex];
			if (tIntAt < 0) {
				tBuilder.append('%');
				tBuilder.append(HEX[tIntAt + 128]);
			} else {
				tBuilder.append((char) tIntAt);
			}
		}
		return tBuilder.toString();
	}
}

考慮したこと

以下のことを考慮する必要がありました。

  • 展開結果のURLがマルチバイト文字を含んでいることがある
    →URLエンコードすることで対処
  • 展開結果のURLのホストサーバがIISだった場合は、HttpURLConnection#getURL()の結果でパスが省略される
    tHttpURLConnection.setInstanceFollowRedirects(false)として、IISへのアクセスを行わずLocationヘッダを取り出すことで対処
  • 遅い
    URL#openConnection(Proxy.NO_PROXY)を指定することで対処
    HEADメソッドを使ってBODYを無視することで対処
  • 展開不要なURLの判定が難しい
    →とりあえず展開を試みて、LocationヘッダHttpURLConnection#getURL()のどちらかを利用することで対処
  • 多段階のリダイレクトが行われることがある
    →HTTPステータスコードが301, 302あいだは再帰的にexpandUrlを呼び出すことで対処

参考URL

  • 短縮URLサービス一覧
    多すぎる・・・orz
    %E7%B8%AE.jp, 1bps.biz, 3.ly, 30m.in, 5wd60.tk, a.r10.to, a5dyo.tk, a8.net, abf.to, ads.modiphi.com, am6.jp, amzn.to, artbeat.ly, bctiny.com, bit.ly, bkite.com, bt.io, dlvr.it, dw.am, fc2.in, feedburner.com, feedburner.jp, feedproxy.google.com, feeds.digitaldj.jp, flic.kr, flickr.com/p/, fnn-news.com, fon.gsyep.it, gigaz.in, go.2ch2.net, goo.gl, ht.ly, htn.to, icio.us, idek.net, instapaper.com, is.gd, j.mp, jpan.jp, labo.tv, liten.be, mfi.rep.tl, mjk.ac, moby.to, moi.st, mpr.hn, nhk.jp, oneclip.jp, ow.ly, pheedo.jp, pheedo.wiredvision.jp, pic.gd, post.ly, pr.cm, qurl.com, r.sm3.jp, rd.yahoo.co.jp, rss.rssad.jp, snipurl.com, snurl.com, sugowaza.jp/r, t.co, terra.es, tinyurl.com, tl.gdff.im, to.ly, tr.im, tumblr.com, tuna.be, twt.mx, twurl.nl, u.nu, url4.eu, urlbrief.com, urltea.com, urltea.me, ustre.am, vdh.bz, vriend.jp, xfs.jp, xrl.us, yfrog.com, youtu.be, z.la, 縮.jp
  • 短縮URLを展開する
  • The Trick To Write A Fast (Universal) Java URL Expander
    Now everyone wants to shorten URLs. Here is a list of 90 + URL shortening services (!!) without counting the ones that you can build by yourself.
    How we (developers) can survive in this jungle if we want to retrieve the real expended version of those tons of URLs?

読者からの反応 (7 件)

  1. avatar Ken より:

    ソースコード拝見いたしました.
    非常に便利だと思います.

    ひとつ質問させていただきたいのですが,
    短縮URLの短縮などで,複数回リダイレクトされる場合には
    どのように展開すべきでしょうか?

    expandUrl()
    を複数回呼ぶしかないでしょうか?

  2. avatar squld より:

    コード中の tHttpURLConnection.setInstanceFollowRedirects(false); を tHttpURLConnection.setInstanceFollowRedirects(true); に変更することで多段階のリダイレクトに対応可能です。
    ただし、setInstanceFollowRedirects(true) にした場合は、「展開結果のURLが指すサーバがIIS」かつ「展開結果のURLがマルチバイト文字を含む」場合にURLのコンテキストパス以降が消えてしまう不具合が確認されています。

    具体的には http://bit.ly/icUhTi を展開して http://nishinomiya-style.com/upload/blog/post/ランチタイムメニュー.JPG というURLになるべきところが http://nishinomiya-style.com/ となってしまうケースです。
    (例として西宮流さんのサイト http://nishinomiya-style.com/ のコンテンツを使わせてもらいました。)

    かなりのレアケースなので、無視しても良いかもしれません。
    回避方法がないか、もうちょっとテストしてみます。

  3. avatar squld より:

    HTTPステータスコードが301, 302あいだは再帰的にexpandUrlを呼び出すことで対処しました。

  4. avatar k より:

    httpsも展開できるので助かります

  5. avatar k より:

    アマゾンの、以下のURLを展開しようとした所、もう展開されているはずなのに302が帰ってきてURLが「http://www.amazon.co.jp」になってしまいます。どうすればいいでしょうか?
    http://www.amazon.co.jp/gp/search?ie=UTF8&keywords=%E6%9D%BE%E4%B8%8B%E5%B9%B8%E4%B9%8B%E5%8A%A9&tag=starofhitman-22&index=books&linkCode=ur2&camp=247&creative=1211

  6. avatar squld より:

    どうやら、該当URLにHEADメソッドでアクセスすると302が返却されるようです。
    GETメソッドだと200が返却されるので、GETに変更すれば治りそうですね。
    具体的には、以下の行を削除すれば治ります。

    tHttpURLConnection.setRequestMethod("HEAD");

    ただし、GETなのでコンテンツを全部取得しようとするため、少し処理速度が遅くなるかもしれません。

    また、似たような問題がAmazon S3上のURLに対してHEADメソッドでアクセスしても発生するようです。

    GETだと200 OKなのに

    $ curl -v 'http://s3.amazonaws.com/twitpic/photos/full/370660253.jpg?AWSAccessKeyId=AKIAJF3XCCKACR3QDMOA&Expires=1313654473&Signature=PR0pe%2FxrgbKWfriRhn2h11Sl5Es%3D' > /dev/null
    * About to connect() to s3.amazonaws.com port 80
    *   Trying 207.171.189.80... connected
    * Connected to s3.amazonaws.com (207.171.189.80) port 80
    > GET /twitpic/photos/full/370660253.jpg?AWSAccessKeyId=AKIAJF3XCCKACR3QDMOA&Expires=1313654473&Signature=PR0pe%2FxrgbKWfriRhn2h11Sl5Es%3D HTTP/1.1
    > User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
    > Host: s3.amazonaws.com
    > Accept: */*
    >
    < HTTP/1.1 200 OK
    

    HEADだと 403 Forbiddenになる

    $ curl -X HEAD -v 'http://s3.amazonaws.com/twitpic/photos/full/370660253.jpg?AWSAccessKeyId=AKIAJF3XCCKACR3QDMOA&Expires=1313654473&Signature=PR0pe%2FxrgbKWfriRhn2h11Sl5Es%3D' > /dev/null
    * About to connect() to s3.amazonaws.com port 80
    *   Trying 207.171.185.200... connected
    * Connected to s3.amazonaws.com (207.171.185.200) port 80
    > HEAD /twitpic/photos/full/370660253.jpg?AWSAccessKeyId=AKIAJF3XCCKACR3QDMOA&Expires=1313654473&Signature=PR0pe%2FxrgbKWfriRhn2h11Sl5Es%3D HTTP/1.1
    > User-Agent: curl/7.15.5 (i686-redhat-linux-gnu) libcurl/7.15.5 OpenSSL/0.9.8b zlib/1.2.3 libidn/0.6.5
    > Host: s3.amazonaws.com
    > Accept: */*
    >
    < HTTP/1.1 403 Forbidden
    

コメントをどうぞ