Kent Tong's personal thoughts on information technology: Concepts of Squid

Basic proxy operation

The normal mode of operation is for squid to serve as a proxy for internal users accessing the Internet. The user's browser is configured to use Squid as the proxy. Then, when the user tries to access a URL, the browser will send the request to Squid instead of the server in the URL (the "origin server").
The request sent to Squid (a "proxy request") is slightly different than normal HTTP request as the former contains the full URL in the, say, GET command, while the latter only contains a path. This allows the proxy to contact the origin server as needed.
When squid receives a request (proxy request), by default it will try to find the resource (the web page) in the following order:

Its cache.
The siblings and parents. Siblings should be deployed on the same site for load-balancing. Parents should be deployed in a larger ISP or regional headquarter. The idea is that the on-site proxies will in time contain mostly resources commonly accessed by the users on that site, while the ISP proxy will contain mostly resources commonly needed by the users in that region.

Squid will query all its siblings and parents (all are called "peers"). If there is a HIT response, it will send the request to the first such peer. If there is none (only MISS response or no response), the first parent returning MISS will get the request. It is supposed to get the resource from the origin server or its own parent if it is not in its cache. If there is no reply from the parents at all (e.g., all are busy or down), Squid will use the parent that is marked as the default.
The chain of request handling should normally stop here, unless, say, all the parents are down.

The origin server. The server hosting the resource (identified in the full URL in the GET command).

After getting the resource, it will save it into its cache to speed up future accesses. The path to the cache and maximum disk space can be configured.
Squid uses the ICP protocol to query its siblings and parents (the peers) to check if they have the resource in their caches. If yes, it will use HTTP to retrieve it. ICP is usually based on UDP and uses UDP port 3130. Squid usually listens for HTTP requests on TCP port 3128.
A sibling or parent (peer) is defined like this:

cache_peer <hostname1> parent 3128 3130
cache_peer <hostname2> sibling 3128 3130

The <hostname> above is used to resolve the IP.

Access security

Usually you should only allow your internal users to use your Squid server (by restricting the client IPs or user authentication). To do that, you can configure ACLs so that only authorized internal users can connect to it (the http_access configuration).

http_access allow myACL1

An ACL is a named Boolean expression that can refer to various properties of the request (e.g., URL, domain name in the URL, HTTP method, source IP, authenticated user name):

acl myACL1 url_regex ^http://...
acl myACL2 dstdomain .foo.com

When an ACL is evaluated and it refers to authenticated user name and etc. but that information is not yet in the request, Squid will perform proxy authentication with the client first.
You can specify multiple ACLs to the same http_access and they will AND'ed together:

http_access allow myACL1 myACL2

Controlling routing of request

You can control the routing between peers by applying ACLs to the peers. When Squid is selecting the peers, it will filter out those whose ACLs aren't satisfied. For example, to force Squid to use a particular parent for accesses to www.foo.com:

acl myACL1 dstdomain www.foo.com
cache_peer mycache1.local parent 3128 3130
cache_peer mycache2.local parent 3128 3130
cache_peer_access mycache1.local allow myACL1
cache_peer_access mycache2.local deny myACL1

SSL processing

Usually the client will send a CONNECT command to Squid to establish a TCP proxy connection to the origin server. Squid will simply forward the raw TCP data (encrypted HTTP commands) back and forth.
In newer versions of Squid, it can fake the origin server's certificate on the fly (so it needs the CA's cert and private key) or just a cert that doesn't really match the origin server. In either case, such cert faking and interception (called "ssl bumping") is configured in the http_port configuration.

Reverse proxy

If you need it to serve as a reverse proxy for Internet users to access your DMZ web servers, most likely you won't need to use http_access to restrict the access. On the other hand, you can't just not configure siblings and parent in the hope that Squid will contact the origin server (using the server URL in the request) because that URL will resolve to the IP of Squid itself! So, the way to do it is to configure origin server as a parent. However, usually Squid sends proxy requests to the parent, but some HTTP servers can only handle normal requests. So, you need to mark the parent as the origin server so that Squid sends it a normal request. You also need to tell Squid not to query it (e.g., ICP) for any resource (because it is not really a cache!):

cache_peer <hostname-of-origin> parent 80 0 originserver no-query

To serve as a reverse proxy, Squid must be prepared to receive normal request instead of proxy request. It must also find out the site name from the Host header to construct the URL. These are done in the http_port with the accel option. If no Host header is included in the request, the defaultsite option provides a default site to use. In an older version, Squid by default will rely on the defaultsite option (probably at that time HTTP 1.0, which didn't have the Host header, was more common). To tell it to use the Host header, you need to add the vhost option:

http_port 80 accel defaultsite=www.foo.com
http_port 80 accel defaultsite=www.foo.com vhost

Usually your public website is on port 80, so you need to configure Squid to listen on port 80:

http_port 80

If the clients are supposed to connect to Squid using SSL, you need to enable SSL processing and the port configure using the http_ports configuration. It is only meaningful for reverse proxy mode of operation (for normal mode, Squid usualserves as a TCP proxy).

http_ports 443 cert=<path to cert file> key=<path to key file>

You should allow the public to access Squid:

 http_access allow all

Transparent proxy

Transparent proxy means that the client believes that it is accessing the HTTP server but in fact it is just talking to the proxy. One way to do it is to let the router perform DNAT so that packets to port 80 will be sent to Squid. Just like the case with reverse proxy, Squid needs to accept normal HTTP request instead of proxy request. By DNAT, the reply packets are translated back and thus the client won't notice any problem.
To tell Squid to run in transparent mode:

http_port 80 transparent

Why not just use the accel mode? There are some differences between them:

In transparent mode, Squid will not perform any authentication because if both it and the real server requires authentication, the client will be confused.
In transparent mode, if there is no Host header in the HTTP request, Squid will try to read the NAT connection state (only works if it is running on the router) to find out the original destination IP and port use those to construct the URL.

If Squid is running on the router, just use a simplified form of DNAT:

iptables -t nat -A PREROUTING -s <client subnet> --dport 80 -j REDIRECT --to-ports 80

If Squid is running on the another host, just use DNAT:

iptables -t nat -A PREROUTING -s <client subnet> --dport 80 -j DNAT --to <ip of squid>

Kent Tong's personal thoughts on information technology

Sunday, March 17, 2013

Concepts of Squid

Basic proxy operation

Access security

Controlling routing of request

SSL processing

Reverse proxy

Transparent proxy

No comments:

Post a Comment